Exploratory Data Analysis (EDA)

What is Exploratory Data Analysis?

Exploratory Data Analysis (EDA) is an approach to analyzing data when you do not yet have a clear hypothesis or modeling goal.
Instead of jumping directly into modeling, EDA focuses on understanding the structure, patterns, and anomalies in the data.

EDA aims to:

Maximize insight into the dataset
Uncover underlying structure
Identify important variables
Detect outliers and anomalies
Test assumptions for later modeling
Develop simpler (parsimonious) models
Generate hypotheses driven by data

EDA by Dimensionality

Low-dimensional data (1–3 dimensions):

Summary statistics (mean, median, variance)
Direct plotting (1D, 2D, 3D)

High-dimensional data:

Visualization becomes difficult
Dimensionality reduction techniques such as PCA are required

Data Visualization

Why Visualize Data?

Humans are exceptionally good at recognizing visual patterns.
Visualization leverages this ability to quickly detect trends, clusters, gaps, and anomalies that are hard to see in raw tables.

The limitation is scale: as the number of dimensions or data points grows, visualization becomes harder and requires careful design.

Four Primary Purposes of Visualization

Composition: What parts make up the whole?
Distribution: How are values spread?
Comparison: How do values differ across groups?
Relationship: How do variables relate to each other?

Data Summarization

Measures of Location

Mean:
$\hat{\mu} = \frac{1}{n}\sum_{i=1}^n x_i$
Median: Middle value (50% above, 50% below)
Quartiles:
- Q1: 25% of data below
- Q3: 75% of data below
Mode: Most frequent value

Measures of Dispersion

Variance:
$\hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n (x_i-\mu)^2$
Standard deviation:
$\hat{\sigma} = \sqrt{\frac{1}{n}\sum_{i=1}^n (x_i-\mu)^2}$
Range: max − min
Interquartile range (IQR): Q3 − Q1

Skew

Skew describes where most of the data mass lies relative to the median.

Negative skew: Long tail on the left, mass at higher values
Positive skew: Long tail on the right, mass at lower values

Composition Visualization

Pie Charts

Pie charts show how discrete categories contribute to a whole.
They are best used when the number of categories is small and differences are large.

Stacked Bar Charts

Stacked bars are generally preferred over pie charts because they:

Allow easier comparison across groups
Show trends over time more clearly

Distribution Visualization

Histograms

Histograms visualize the distribution of a single continuous variable by dividing the range into bins and counting observations per bin.

They reveal:

Center (mean/median)
Spread
Skew
Outliers
Multiple modes

Histogram Limitations

Histograms can be misleading for small datasets because the shape depends heavily on bin width.
Different bin choices can lead to very different interpretations.

Kernel Density Estimation (KDE)

KDE estimates a smooth probability density function by placing a kernel around each data point.

$\hat{f}(x) = \frac{1}{n}\sum_{i=1}^n K\left(\frac{x-x_i}{h}\right)$

Kernel K: Shape (Gaussian, uniform, etc.)
Bandwidth h: Controls smoothness

Small $h$ captures detail but may be noisy; large $h$ smooths noise but may hide structure.

Comparison Visualization

Bar Plots

Bar plots compare values across categories or models.
They are effective for showing differences in magnitude.

Box Plots

Box plots summarize a continuous variable across discrete groups.

Center line: median
Box: first to third quartile
Whiskers: range (or 1.5×IQR)
Points outside: outliers

Relationship Visualization

Scatter Plots

Scatter plots display relationships between two continuous variables.

They reveal:

Presence or absence of relationships
Linear vs non-linear trends
Outliers
Homoskedastic vs heteroskedastic behavior

Scatterplot Matrix

A scatterplot matrix shows pairwise relationships among many variables.
Each cell contains a scatter plot for one variable pair.

Overplotting and Jitter

When data points overlap heavily (common with integer data), patterns become hidden.
Jittering adds small random noise to reveal data density.

Dimensionality Reduction

Why Reduce Dimensionality?

Simplifies modeling
Reduces computational cost
Removes redundancy
Reveals hidden structure

Dimensions That Can Be Dropped

Constant: no variation
Nearly constant: minimal variation
Linearly dependent: redundant information

Goals of Dimensionality Reduction

High variance: preserve informative dimensions
Low covariance: avoid redundant dimensions

Change of Basis

The most informative directions in data are often not aligned with the original axes.
Dimensionality reduction rotates the coordinate system to align with directions of maximum variance.

Principal Component Analysis (PCA)

PCA Overview

PCA transforms an $n \times p$ data matrix into a new representation with fewer dimensions
while preserving as much variance as possible.

PCA Steps

Center data so each column has mean 0
Compute covariance matrix $\Sigma = X^\top X$
Perform eigendecomposition: $\Sigma = Q\Lambda Q^\top$
Use eigenvectors as new axes (principal components)

Key Mathematical Goal

Choose a transformation $A$ so that:
$Y = XA$ has uncorrelated dimensions:
$Y^\top Y = \Lambda$

Dimensionality Reduction with PCA

Keeping only the first $m$ principal components (where $m < p$ )
reduces dimensionality while retaining most variance.

Scree Plot

A scree plot shows eigenvalues versus component index.
The number of components is chosen where most variance is captured (often 80–90%).

PCA Applications

Image compression
Facial recognition (eigenfaces)
Finance (market factors)
High-dimensional visualization